Perl and NLP, parse Names out of Biographies - perl

I'm pretty new to NLP in general, but getting really good at Perl, and I was wondering what kind of powerful NLP modules are out there. Basically, I have a file with a bunch of paragraphs, and some of them are people's biographies. So, first I need to look for a person's name, and that helps with the rest of the process later.
So I was roughly starting with something like this:
foreach $PPid (0 .. $PPscalar) {
$paragraph = #PP[$PPid];
if ($paragraph =~ /^(\w+ \w\. \w+|\w+ \w+)( also|)( has served| served| worked| joined| currently serves| has| was| is|, )/){
$possibleName = $1;
$badName = 0;
foreach $piece (#pieces){
if ($possibleName =~ /$piece/){
$badName = 1;
}
}
if ($badName == 0){
push #namePile, $possibleName;
}
}
}
Because most of the names start at the beginning of the paragraphs. And then I'm looking for keywords that denote action or possession, but right now, that picks up extra junk that is not a name. There has to be a module to do this, right?

Extracting names from data is hard. There are a variety of solutions. For named entity extraction you've got the following
The naive approach. I remember looking at this and being unimpressed with the output.
The dictionary approach. I've used this, but lots of false negatives, and I'm not too fond of the code underneath it.
An open source binary with a perl interface (not recommended, and I'm the author of this cpan library - and setting it up is fiddly too).
Best solution is the propietary web service with the Net::Calais perl wrapper
Net::Calais is by far the best bet for speed and accuracy. Go with the Stanford library if you need the underlying implementation to be open source.

Have you tried searching CPAN?
http://search.cpan.org/search?query=NLP&mode=all
I also tried searching for "Natural Language" and found the following that you might be interested in:
Lingua::EN::Tagger
Also, if you must roll your own, with regards to NLP, you want to check out Regexp::Grammars. This is the successor to Parse::RecDesent.

I don't know of any Perl modules which do processing of English in order to break it into parts of speech. I expect there are libraries out there which do that, in C or C++ or something, so if you don't find a good answer, maybe you can broaden your search.
One easy hack is to check for two words which are both capitalized:
if (/[A-Z][a-z]+\s+[A-Z][a-z]/) { ...
or check for titles:
if (/(?:Mr|Mrs|Ms|Dr)\.?\s+[A-Z][a-z]+/) { ...

Related

Perl: How to retrieve album metadata from MusicBrainz?

I am creating a Perl script which will move a mp3 file to my music folder in format artist/album/mp3file. Now it is possible that some of my mp3 files don't have an album tag so I thought of querying the MusicBrainz database to retrieve album metadata given track title & artist.
I am using WebService::MusicBrainz Perl module for this task, but I am not able to see any method that gives album metadata info. My current code is:
use WebService::MusicBrainz::Track;
my $ws = WebService::MusicBrainz::Track->new();
my $response = $ws->search({ ARTIST => 'Ryan Adams', TITLE => 'when the stars go blue' });
my $track = $response->track();
print $track->title(), " - ", $track->artist()->name(), "\n";
say $track->id();
So, how do I get my the album info for a given track using MusicBrainz and if it is not possible what are my alternative options?
First of all, what you want is adding metadata to mp3s which is the most common usage scenario people have. The "normal" way is to use a Musicbrainz Tagger, open these files there and work with the interface to attach the correct metadata.
The suggested (gui) tool is Musicbrain Picard
I also want to state that the Perl module is using the now deprecated Web Service Version 1 of MusicBrainz.
That Web Service has a couple of problems because it was made for another database scheme than the one used now at MusicBrainz.
However, the current Web Service Version 2 has only a python library available: python-musicbrainzngs.
You can still work with the Perl module, but if you run into "weird" problems, this might be the reason.
This is how the Web Service works in general (and how it should apply directly for the Perl module as a wrapper for this web service):
Your search gives this:
http://musicbrainz.org/ws/1/track/?artist=%22Ryan%20Adams%22&title=%22when%20the%20stars%20go%20blue%22
There you get a list of recordings of this track. These recodings occur on multiple releases (ReleaseList).
You can disregard many of these, as they are of the type "compilation". You probably want the "album" releases.
You probably ask yourself why there are multiple album releases with the same name in the list.
This is because a "release" on MusicBrainz is a combination of a release-event and a couple of mediums.
You might have an US release and a german deluxe edition and so on.
All of these releases are in one "release group".
You probably want the name of this "release group", which mostly is also the name of every release in this group.
You might want to read a bit on how the MusicBrainz Database is structured.
This is only the basic use case of course.
You might run into misspellings in artist/title, multiple or missing album release groups and other things.
However, altogether it should work and you can just drop the "problem" cases in a special directory and work with them in Picard.
Picard also has other means of identifying files per "musical analysis" (PUIDs, Acoustids)
EDIT:
my #tracklist = $response->track_list();
foreach my $track ( #tracklist ) {
print $track->title(), " - ", $track->artist()->name(), "\n";
my #releaselist = $track->release_list();
foreach my $release ( #releaselist ) {
print " ", $release->title(), " - ", $release->type();
}
}
Should work in general, but it doesn't. It gives you all tracks of the response, but somehow it can't extract releases from release_list(). Possibly because the schema changed or because the perl module is broken.
Check out our perl modules for accessing the Cover Art Archive:
http://metacpan.org/pod/Net::CoverArtArchive
More info on our archive is here, including specs:
http://coverartarchive.org/
Good luck!

[Zend]Filtering variables in a huge project

I have huge application written in ZendFramework. Earlier everything was fine.
As for now it was redesigned and received a lot of new functionalities and options, but I have to defend this software from xss.
Variables are taken from a couple sources (webform, Webservices, api, etc.), some of them should be escaped, some not.
What do you think, what will be the best method to defend my website, without editing all (2 000 +) files and escaping all echo's ?
Zend Framework comes with a class called "Zend_Filter". This class has a "StripTags" filter option that will strip all tags from a given string.
http://framework.zend.com/manual/en/zend.filter.set.html#zend.filter.set.striptags
If you note, even the strip tags filter isn't recommended for sanitizing input if you exclude something and it shouldn't be used to defend against XSS attacks. It recommends using Tidy or HTMLPurifier.
http://tidy.sourceforge.net/
http://htmlpurifier.org/
I think HTML Purifier is pretty easy to use. From their docs website:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault(); $purifier = new
HTMLPurifier($config); $clean_html = $purifier->purify($dirty_html);
I hope that helps!
Cheers!

Selenium WebDriver with Perl

I am trying to run the Selenium driver with Perl bindings, and due to the lack of examples and documentation, I am running into some roadblocks. I have figured out how to do some basic things, but I seem to be running into some issues with other simple things like validating the text on a page using Remote::Driver package.
If I try to do something like this:
$sel->get("https://www.yahoo.com/" );
$ret = $sel->find_element("//div[contains( text(),'Thursday, April 26, 2012')]");
I get a message back that the element couldn't be found. I am using xpath because the driver package doesn't appear to have a sub specific for finding text.. at least not that I've found.
If my xpath setup is wrong or if someone knows a better way, that would be extremely helpful. I'm having problems with some button clicking too.. but this seems like it should be easier and is bugging me.
Finding text on a web page and comparing that text to some "known good value" using Selenium::Remote::Driver can be implemented as follows:
File: SomeWebApp.pm
package SomeWebApp;
sub get_text_present {
my $self = shift;
my $target = shift;
my $locator = shift;
my $text = $self->{driver}->find_element($target, $locator)->get_text();
return $text;
}
Somewhere in your test script: test.pl
my $text = $some_web_app->get_text_present("MainContent_RequiredFieldValidator6", "id");
The above finds the element identified by $target using the locating scheme identified by $locator and stores it in the variable $text. You can then use that to compare / validate as required / needed.
https is a tad slower loading than http. Although WebDriver is pretty good about waiting until it's figured out that the requested page is fully loaded, maybe you need to give it a little help here. Add a sleep(2); after the get() call and see if it works. If it does, try cutting down to 1 second. You can also do a get_title call to see if you've loaded the page you think you have.
The other possibility is that your text target isn't quite exactly the same as what's on the page. You could try looking first for one word, such as "April", and see if you get a hit, and then expand until you find the mismatch (e.g., does this string actually have a newline or break within it? How about an HTML entity such as a non-breaking space?). Also, you are looking for that bit of text anywhere under a div (all child text supposedly is concatenated, and then the search done). That would more likely cast too wide a net than not get anything at all, but it's worth knowing.

Erroring in my Perl script coming from CAM::PDF::Annot module. Don't know why

I believe this may be a bug in the module I am using, or I am just completely overlooking something.
My code is this:
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use CAM::PDF::Annot;
sub main()
{
my $pdf = CAM::PDF::Annot->new( 'b.pdf' );
my $otherDoc = CAM::PDF::Annot->new( 'b_an.pdf' );
my $page = 1;
my %refs;
my #list = #{$pdf->getAnnotations($page)};
for my $annotRef (#list){
$otherDoc->appendAnnotation( $page, $pdf, $annotRef, \%refs);
}
$otherDoc->output('pdf_merged.pdf');
}
exit main;
This code was taken almost directly from the synopsis found on the module's CPAN page: http://metacpan.org/pod/CAM::PDF::Annot
The problem comes when I run the script using TWO pdf's with annotations. Using two pdf's without annotations runs. Using one pdf with annotations, and one pdf without annotations, runs. Only when both pdf's have annotations does it error.
The error is: "Can't use string ("46") as an ARRAY ref while "strict refs" in use at /usr/opt/perl5/lib/site_perl/5.10.1/CAM/PDF/Annot.pm line 195"
Line 195 of Annot.pm is:
push #{$annots->{value}}, $pupRef;
Annot.pm is inside the CAM::PDF::Annot module.
Any guidance in fixing this would be greatly appreciated!
P.S. In the error, "string ("x")", x is always a number, and seems to change depending on the pdf and the annotations within the pdf.
And I will try to add any other information that you need to help figure this out!
Whenever I have a problem with a CPAN module, I go to its webpage to try and assess its quality and see if any bugs have already been reported.
http://search.cpan.org/~donatoaz/CAM-PDF-Annot-0.06 shows the following suspicious results:
CPAN Testers PASS (2) FAIL (168) NA (49)
It is surprising that you were able to install the module. No one has reported bugs, but there is clearly a major problem with the code. It seems the author is either unaware of the tester reports (which have been sent to his CPAN email address for more than a year), or has stopped maintaining it.
You could submit a bug report, so at least others will be aware of your issue.
I realize this does not answer your question of how to fix the problem, but even if you do identify a fix, the author may not apply it (in which case, someone could start the process of becoming a co-maintaner).

How can I read the URL-Data send with POST in Perl?

I'm trying to read out the POST-Data that was sent from a form in a page to my Perl Script. I googled and found out that:
read(STDIN, $param_string, $ENV{'CONTENT_LENGTH'})
reads out the whole Data-String with and writes the whole string to $param_string in the form of
Param1=Value1&Param2=Value2&Param3=Value3
by spliting it at the right places I get the necessary Data.
But I wonder why my $param_string is empty.
When I try the whole thing with GET:
$param_string = $ENV{'QUERY_STRING'};
everything works fine. Does anybody have an idea?
There absolutely no real reason for someone at your level to want to hand parse CGI requests.
Please use CGI::Simple or CGI.pm.
CGI.pm has a lot of baggage (HTML generation, function oriented interface) which makes CGI::Simple preferable.
Using any CGI processing module on CPAN is better than trying to write CGI processing code from scratch.
See parse_query_string in CGI::Simple for a way of accessing parameters passed using the query string when processing a form that is POSTed to your script.
If you want to learn how to do it right, you can read the source code of either module. Reading through the CGI.pm CHANGES file is also instructive.
If you are able to retrieve GET-data but not able to retrieve POST-data, most likely you forgot to change form method from to be post. You can check your submit method by using this condition in if statement:
if ($ENV{'REQUEST_METHOD'} eq "POST"){
read(STDIN, $param_string, $ENV{'CONTENT_LENGTH'});
}else {
$param_string = $ENV{'QUERY_STRING'};
}
Under mod_perl 2, Apache2::Request works for me.