What's the best way to write a maintainable web scraping app? - perl

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.
So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

In Perl, something like WWW::Mechanize can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:
my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->submit_form(
form_number => 1,
fields => { password => $password },
);
die unless ($mech->success);

A combination of WWW::Mechanize and Web::Scraper are the two tools that make me most productive. Theres a nice article about that combination at the catalyzed.org

If I were to give you one advice, it would be to use XPath for all your scraping needs. Avoid regexes.

Hmm, just found
Finance::Bank::Natwest
Which is a perl module specifically for my bank! Wasn't expecting it to be quite that easy.

A lot of banks publish their data in a standard format, which is commonly used by personal finance packages such as MS Money or Quicken to download transaction information. You could look for that hook and download using the same API, and then parse the data on your end (e.g. parse Excel documents with Spreadsheet::ParseExcel, and Quicken docs with Finance::QIF).
Edit (reply to comment): Have you considered contacting your bank and asking them how you can programmatically log into your account in order to download the financial data? Many/most banks have an API for this (which Quicken etc make use of, as described above).

There's a currently up to date Ruby implementation here:
http://github.com/warm/NatWoogle

Use perl and the web::scraper package:
link text

Related

Controlling a random webpage through a Perl script

There is a random website say abc.com and this website has a search engine. Is it possible to create a perl script to automatically read from a text file and feed search values into this search engine and automatically download the files that are the result of the search ? Once the download is complete, the loop has to continue until all the search values have been exhausted. I don't have any server details about the website itself.
Any help is much appreciated. Thanks !
This is HTTP client programming. You're basically writing a program that is pretending to be a browser.
The standard module for doing this is probably WWW::Mechanize (see the cookbook and the examples).
If you want something lower level, then the LWP bundle of modules will do all that you want.
There's a free online book. But it's a little old and probably doesn't reflect current best practices.

How to hold or save the DTMF input in VXML? Any guides to set up a test IVR (VXML) service?

So I currently have an IVR written in some dodgy old code which is confusing and goes way over the top for some things.
I'm wanting to re-write one of my basic IVRs with VXML.
So a little bit of research is that I can call perl scripts which I can use to run data past databases, that part isnt to bad.
My question is how, or what is the syntax to use to "hold" or save the dtmf input for a menu, and then pass it to the perl script.
Question two.
Hosting of the VXML IVR. Are there any guides to setting up a test service? I have a PABX, and a few servers I can play around with.
To play around with VoiceXML I would recommend Voxeo's excellent platform called Prophecy. You can get two ports for free that you can run on a server or even on your workstation/laptop. They provide a SIP softphone to test your apps so it does not require any elaborate setup; just a simple install and you are ready to go. They also have hosted environment that you can test from for free. You just pay for the service if you put it into production. Here is a post that describes how to setup and test applications in the hosted environment. And here is a post on how to setup and test applications if you install Prophecy on your PC. Voxeo's CTO is on the VoiceXML standards committee so their platform conforms very close to the standard.
Voxeo's developer site has excellent documentation on VoiceXML that is full of examples. On your question for how to get dtmf input you can go to the bottom of the left pane in the documentation and click on the element "field". The field element is used to collect information from the caller. To easily do this with DTMF input you can use the builtin grammars. For more information on builtin grammars look at the documentation on the "type" attribute of the "field" element. Once you get a "filled" event from the "field" you can call your Perl script using a "submit" element. Voxeo's documentation has a link to this article on creating a VoiceXML applications with Perl. The Voxeo Forum is also an excellent source of information on VoiceXML and the Prophecy. If you cannot find an answer to your question in the Forum just ask it and their knowledgeable support staff will assist.
If you are also familiar with .NET technologies there is an open source project called VoiceModel that makes it easy to develop VoiceXML applications using ASP.NET. The project has a lot of examples in it.
These resources should get you started with VoiceXML fairly quickly.
To specifically answer your DTMF question, just use <submit> to send the DTMF input to the perl script, using the attribute namelist (which is just a list of variables that you need to send).
Also, from the VXML 2.0 specification:
"The <submit> element is used to submit information to the origin Web server and then transition to the document sent back in the response. Unlike <goto>, it lets you submit a list of variables to the document server via an HTTP GET or POST request. For example, to submit a set of form items to the server you might have:
<submit next="log_request" method="post"
namelist="name rank serial_number"
fetchtimeout="100s" fetchaudio="audio/brahms2.wav"/>
"

Existing app that extracts meaningful data from old e-mails?

I was wondering if there is an application, and if not if it's worth writing one, that can gather meaningful data from old e-mails. I'm thinking things like:
Instructions (that could become "5 steps to..." posts)
Definitions
etc
Any idea? Suggestions? etc?
Well, I can offer the same solution as I did to this post, that is software like TexLexan or Alchemy API that can find keywords and other summary information. There is also a good list of open source and commercial solutions on this page. Definitely easier to see if one of those works then writing your own.

How to integrate vBulletin features into an external site

I have a web site I'm building and the client wants to have features from vBulletin (blog, forums) integrated into the site. Its not enough to simply add the sites skin to vBulletin. Is there a way to do this?
I would expect there to be documentation on how, if it is possible, to do such a thing but haven't been able to find anything.
I'd rather not connect and query the vBulletin database directly.
There is no proper API for this yet, so you'd either have to rely on things like RSS, or query the database directly. RSS won't get you old data, nor any forum structures, etc. just basics of new data.
After much research (see: cursing) I've found that external.php and blog_external.php do what I want though not quite as elegantly as I would like.
So if you want to incorporate forum threads into your web page then external.php is what you need. It appears to be a bit more customizable in that you can have it output in JavaScript, XML, RSS, and RSS Enclosure (podcasting).
If you want to incorporate blog posts you appear to be limited to RSS only. Like I said, less than ideal but at least its something.
There is more information here: http://www.vbulletin.com/docs/html/vboptions_group_external

Best method to write an email poller

I am working on an email polling solution, for a multi-user system. So users can send emails on their respective ids and it would be polled and inserted to a db.
There are two options that I am considering:
Perl/Unix based email pollers..
A java based poller.
What would you recommend.. (other suggestions are also welcome)
Instead of polling, why don't you forward the mail to a process? Depending on the mail server you use, you can do that as an alias or even in the .forward file.
I've nothing much to add to this, but there's currently a project at google code to rebuild iwantsandy.com as open source.
It's at:
http://code.google.com/p/sandysback/
I'm definitely going to be watching this to see how they parse emails, and have those emails "inserted into a db"
Whichever language you have most experience in!
I personally know java and perl well and for this task I would choose perl but the differneces are marginal.
Perl would be shorter and sweeter, java would be take longer but probably be a more robust solution once the database access is sorted out.
I find Perl DBI is a better and more portable database interface than JDBC which does not hide database implementations from your code and is sensitive to version changes etc. I.E. you must have the right version of the right database driver for your target database.
RE: Poling
If you have the option to forward the email to a process I would highly recommended you do that. (Forwarding generally puts less load on the server than poling does.) If not, then poling is the next best thing. Look into the POP3 client libraries on whichever language you are most comfortable with.
RE: Language choice
If I intended to do a lot of parsing of the emails then Perl would be my choice. If not much parsing is involved then Java would be the way to go for me ;-).
-- In a former life I wrote a Perl script to parse (well structured) incoming emails into HTML pages and post them to a web server.
You have a couple of options. As the orginal poster said - probably the simplest way is to set up an entry in the aliases file to a script.
Then the body of the email gets passed as standard input to the script. You can then use a perl script + Mime modules to parse out the bits of the message and do whatever you want with it.
One might also look at apache james - which is a custom mail server. They have the equivilent of servlets, called 'maillets' that you put your business logic in. They often hard to deploy in enterprise scenario's though as most companies don't like having custom mailservers being deployed.
... the aliases route is probably your best bet. one other note of caution - email isn't gauranteed. if you are using this as some sort of app to app messaging system, and you control both ends, you should probably look at something else, like JMS type messaging.
-Ace