Save a pdf file that's been opened in Internet Explorer with OLE and Perl - perl

I am looking for a way to use Perl to open a PDF file in Internet Explorer and then save it.
(I want the user to be able to interact with the script and decide whether downloading occurs, which is why I want to pdf to be displayed in IE, so I cannot use something like LWP::Simple.)
As an example, this code loads (displays) a pdf, but I can't figure out how to get Perl to tell IE to save the file.
use Win32::OLE;
my $ie = Win32::OLE->new("InternetExplorer.Application");
$ie->{Visible} = 1;
Win32::OLE->WithEvents($ie);
$ie->Navigate('http://www.aeaweb.org/Annual_Meeting/pdfs/2014_Registration.pdf');
I think I might need to use the OLE method execWB, but I haven't been able to figure it out.

What you want to do is automate the Internet Explorer UI. There are many libraries out there that will do this. You tell the library to find your window of interest, and then you can send keystrokes or commands to the window (CTRL-S in your case).
A good overview on how to do this in Perl is located here.
Example syntax:
my #keys = ( "%{F}", "{RIGHT}", "E", );
for my $key (#keys) {
SendKeys( $key, $pause_between_keypress );
}
The code starts with an array containing the keypresses. Note the
format of the first three elements. The keypresses are: Alt+F, right
arrow, and E. With the application open, this navigates the menu in
order to open the editor.
Another option is to use LWP:
use LWP::Simple;
my $url = 'http://www.aeaweb.org/Annual_Meeting/pdfs/2014_Registration.pdf';
my $file = '2014_Registration.pdf';
getstore($url, $file);

ForExecWB here is good thread, however it is not solved: http://www.perlmonks.org/?node_id=477361
$IE->ExecWB($OLECMDID_SAVEAS, $OLECMDEXECOPT_DONTPROMPTUSER,
$Target);
Why don't you display the PDF in IE then close the IE and save the file using LWP?

You could use Selenium and the perl remote drivers to manage IE
http://search.cpan.org/~aivaturi/Selenium-Remote-Driver-0.15/lib/Selenium/Remote/Driver.pm
http://docs.seleniumhq.org/projects/webdriver/
You will also need to download the IE selenium driver - it comes with firefox as standard
https://code.google.com/p/selenium/wiki/InternetExplorerDriver
use Selenium::Remote::Driver;
my $driver = new Selenium::Remote::Driver;
$driver->get('http://www.google.com');
print $driver->get_title();
$driver->quit();

Related

CGI/Perl script creating a customized signature file

I have this working somewhat.
I have a cgi file that has the following code:
#!/usr/bin/perl
use CGI;
$cgi = new CGI;
open (IMAGE, "ts.jpg");
$size = -s "ts.jpg";
read IMAGE, $data, $size;
close (IMAGE);
print $cgi->header(-type=>'image/jpeg'), $data;
exit;
This displays my image file corrrectly.
However, I want a user to be able to add 2 lines of text over the image through a web form to generate a new jpeg each time. Here is the URL: http://elearning.cpma.ca/signature.html
What am I missing in my cgi file that would allow me to re-publish to screen a new jpeg file with the 2 lines of text appearing on it when I click on the "Add Text" Button?
Any assistance would be really appreciated.
You'll need an image processing library. You'll see lots of recommendations for GD or ImageMagick, but I think I'd use Imager, as it's newer and a little easier to use.
A few general suggestions for improvements to your code.
Always use strict and warnings in your code.
Declare variables with my (my $cgi = ...).
The new CGI syntax is potentially problematic. Use CGI->new instead.
Use three-arg open and lexical filehandles (open my $image_fh, '<', 'ts.jpg')).
Always check the return code from open (open my $image_fh, '<', 'ts.jpg') or die $!).
I was able to create what I wanted through a Readme provided by alvarotrigo on GitHUb TextPainter
I was required to make some minor code changes in order to make it work properly - however, it was very easy to implement.
No CGI required. See SignatureFile for the final outcome of my image.
Thank you to all who responded to my issue.

Automatic Search Using WWW::Mechanize

I am trying to write a Perl script which will automatically key in search variables on this LexisNexis search page and retrieve the search results.
I am using the WWW::Mechanize module but I am not sure how to figure out the field name of the search bar itself. This is the script I have so far ->
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
my $url = "http://www.lexisnexis.com/hottopics/lnacademic/?verb=sr&csi=379740";
$m->get($url);
$m->form_name('f');
$m->field('q', 'Test');
my $response = $m->submit();
print $response->content();
However, I think the "Name" of the search box in this website is not "q". I am getting the following Error - "Can't call method "value" on an undefined value at site/lib/WWW/Mechanize.pm line 1442." Any help is much appreciated. Thank you !
If you disable the JavaScript in your browser then you will notice that the search form doesn't load which means it's being loaded by JavaScript, that's why you are unable to handle it with WWW::Mechanize. Have a look at WWW::Mechanize::Firefox, this might help you with your task. Check out the example scripts, cookbook and FAQs.
You can also do the same using Selenium, see Gabor's tutorial on Selenium.

Downloads in Firefox using Perl WWW::Mechanize::Firefox

I have a list of URLs of pdf files that i want to download, from different sites.
In my firefox i have chosen the option to save PDF files directly to a particular folder.
My plan was to use WWW::Mechanize::Firefox in perl to download each file (in the list - one by one) using Firefox and renaming the file after download.
I used the following code to do it :
use WWW::Mechanize::Firefox;
use File::Copy;
# #list contains the list of links to pdf files
foreach $x (#list) {
my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);
$mech->get($x); #This downloads the file using firefox in desired folder
opendir(DIR, "output/download");
#FILES= readdir(DIR);
my $old = "output/download/$FILES[2]";
move ($old, $new); # $new is the URL of the new filename
}
When i run the file, it opens the first link in Firefox and Firefox downloads the file to the desired directory. But, after that the 'new tab' is not closed and the file does not get renamed and the code keeps running (like its encountered an endless loop) and no further file gets downloaded.
What is going on here? Why isnt the code working? How do i close the tab and make the code read all the files in the list? Is there any alternate way to download?
Solved the problem.
The function,
$mech->get()
waits for 'DOMContentLoaded' Firefox event to be fired by Firefox upon page load. As i had set Firefox to download the files automatically, there was no page being loaded. Thus, the 'DOMContentLoaded' event was never being fired. This led to pause in my code.
I set the function to not wait for the page to load by using the following option
$mech->get($x, synchronize => 0);
After this, i added 60 second delay to allow Firefox to download the file before code progresses
sleep 60;
Thus, my final code look like
use WWW::Mechanize::Firefox;
use File::Copy;
# #list contains the list of links to pdf files
foreach $x (#list) {
my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);
$mech->get($x, synchronize => 0);
sleep 60;
opendir(DIR, "output/download");
#FILES= readdir(DIR);
my $old = "output/download/$FILES[2]";
move ($old, $new); # $new is the URL of the new filename
}
If i understood you correctly, you have the links to the actual pdf files.
In that case WWW::Mechanize is most likely easier than WWW::Mechanize::Firefox. In fact, i think that is almost always the case. Then again, watching the browser work is certainly cooler.
use strict;
use warnings;
use WWW::Mechanize;
# your code here
# loop
my $mech = WWW::Mechanize->new(); # Could (should?) be outside of the loop.
$mech->agent_alias("Linux Mozilla"); # Optionally pretend to be whatever you want.
$mech->get($link);
$mech->save_content("$new");
#end of the loop
If that is absolutely not what you wanted, my cover story will be that i did not want to break my 666 rep!

How can I screen-scrape output from telnet in Perl?

I can setup a telnet connection in Perl no problems, and have just discovered Curses, and am wondering if I can use the two together to scrape the output from the telnet session.
I can view on a row, column basis the contents of STDOUT using the simple script below:
use Curses;
my $win = new Curses;
$win->addstr(10, 10, 'foo');
$win->refresh;
my $thischar=$win->inch(10,10);
print "Char $thischar\n";
And using the below I can open a telnet connection and send \ receive commands with no problem:
use net::telnet;
my $telnet = new Net::Telnet (Timeout => 9999,);
$telnet->open($ipaddress) or die "telnet open failed\n";
$telnet->login($user,$pass);
my $output = $telnet->cmd("command string");
... But what I would really like to do is get the telnet response (which will include terminal control characters) and then search on a row \ column basis using curses. Does anyone know of a way I can connect the two together? It seems to me that curses can only operate on STDOUT
Curses does the opposite. It is a C library for optimising screen updates from a program writing to a terminal, originally designed to be used over a slow serial connection. It has no ability to scrape a layout from a sequence of control characters.
A better bet would be a terminal emulator that has an API with the ability to do this type of screen scraping. Off the top of my head I'm not sure if any Open-source terminal emulators do this, but there are certainly commercial ones available that can.
If you are interacting purely with plain-text commands and responses, you can use Expect to script that, otherwise, you can use Term::VT102, which lets you screen scrape (read specific parts of the screen, send text, handle events on scrolling, cursor movement, screen content changes, and others) applications using VT102 escape sequences for screen control (e.g., an application using the curses library).
You probably want something like Expect
use strict;
use warnings;
use Expect;
my $exp = Expect->spawn("telnet google.com 80");
$exp->expect(15, #timeout
[
qr/^Escape character.*$/,
sub {
$exp->send("GET / HTTP/1.0\n\n");
exp_continue;
}
]
);
You're looking for Term::VT102, which emulates a VT102 terminal (converting the terminal control characters back into a virtual screen state). There's an example showing how to use it with Net::Telnet in VT102/examples/telnet-usage.pl (the examples directory is inside the VT102 directory for some reason).
It's been about 7 years since I used this (the system I was automating switched to a web-based interface), but it used to work.
Or you could use the script command for this.
From the Solaris man-page:
DESCRIPTION
The script utility makes a record of everything printed
on your screen. The record is written to filename. If no file name
is given, the record is saved in the file typescript...
The script command forks and creates a
sub-shell, according to the value of
$SHELL, and records the text from this
session. The script ends when the
forked shell exits or when
Control-d is typed.
I would vote also for the Expect answer. I had to do something similar from a gui'ish application. The trick (albeit tedious) to get around the control characters was to strip all the misc characters from the returned strings. It kind of depends on how messy the screen scrape ends up being.
Here is my function from that script as an example:
# Trim out the curses crap
sub trim {
my #out = #_;
for (#out) {
s/\x1b7//g;
s/\x1b8//g;
s/\x1b//g; # remove escapes
s/\W\w\W//g;
s/\[\d\d\;\d\dH//g; #
s/\[\?25h//g;
s/\[\?25l//g;
s/\[\dm//g;
s/qq//g;
s/Recall//g;
s/\357//g;
s/[^0-9:a-zA-Z-\s,\"]/ /g;
s/\s+/ /g; # Extra spaces
}
return wantarray ? #out : $out[0];
}

Convert Word doc or docx files into text files?

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.
I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.
Any suggestions?
A simple Perl only solution for docx:
Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)
Use XML::LibXML to parse it.
Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)
Cheers !
J.
Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via Tools → Macro → Visual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.
Here is an example using Win32::OLE:
#!/usr/bin/perl
use strict;
use warnings;
use File::Spec::Functions qw( catfile );
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
my $word = get_word();
$word->{Visible} = 0;
my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');
$doc->SaveAs(
catfile($ENV{TEMP}, 'test.txt'),
wdFormatTextLineBreaks
);
$doc->Close(0);
sub get_word {
my $word;
eval {
$word = Win32::OLE->GetActiveObject('Word.Application');
};
die "$#\n" if $#;
unless(defined $word) {
$word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
or die "Oops, cannot start Word: ",
Win32::OLE->LastError, "\n";
}
return $word;
}
__END__
For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.
For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.
Aspose.Words has a very simple API with great support too I have found.
There is also this bash command from commandlinefu.com which works by unzipping the .docx:
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.
If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.
Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.
You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.
On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.
.doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.
The method of Sinan Ünür works well.
However, I got some crash with the files I was transforming.
Another method is to use Win32::OLE and Win32::Clipboard as such:
Open the Word document
Select all the text
Copy in the Clipboard
Print the content of Clipboard in a txt file
Empty the Clipboard and close the Word document
Based on the script given by Sigvald Refsu in http://computer-programming-forum.com/53-perl/c44063de8613483b.htm, I came up with the following script.
Note: I chose to save the txt file with the same basename as the .docx file and in the same folder but this can easily be changed
###########################################
use strict;
use File::Spec::Functions qw( catfile );
use FindBin '$Bin';
use Win32::OLE qw(in with);
use Win32::OLE::Const 'Microsoft Word';
use Win32::Clipboard;
my $monitor_word=0; #set 1 to watch MS Word being opened and closed
sub docx2txt {
##Note: the path shall be in the form "C:\dir\ with\ space\file.docx";
my $docx_file=shift;
#MS Word object
my $Word = Win32::OLE->new('Word.Application', 'Quit') or die "Couldn't run Word";
#Monitor what happens in MS Word
$Word->{Visible} = 1 if $monitor_word;
#Open file
my $Doc = $Word->Documents->Open($docx_file);
with ($Doc, ShowRevisions => 0); #Turn of revision marks
#Select the complete document
$Doc->Select();
my $Range = $Word->Selection();
with ($Range, ExtendMode => 1);
$Range->SelectAll();
#Copy selection to clipboard
$Range->Copy();
#Create txt file
my $txt_file=$docx_file;
$txt_file =~ s/\.docx$/.txt/;
open(TextFile,">$txt_file") or die "Error while trying to write in $txt_file (!$)";
printf TextFile ("%s\n", Win32::Clipboard::Get());
close TextFile;
#Empty the Clipboard (to prevent warning about "huge amount of data in clipboard")
Win32::Clipboard::Set("");
#Close Word file without saving
$Doc->Close({SaveChanges => wdDoNotSaveChanges});
# Disconnect OLE
undef $Word;
}
Hope it can helps you.
You can't do it in VBA if you don't want to start Word (or another Office application). Even if you meant VB, you'd still have to start a (hidden) instance of Word to do the processing.
I need a way to convert .doc or .docx extensions to .txt without installing anything
for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done
Just joking.
You could use antiword for the older versions of Word documents, and try to parse the xml of the new ones.
With docxtemplater, you can easily get the full text of a word (works with docx only).
Here's the code (Node.JS)
DocxTemplater=require('docxtemplater');
doc=new DocxTemplater().loadFromFile("input.docx");
result=doc.getFullText();
This is just three lines of code and doesn't depend on any word instance (all plain JS)