What does this Lucene-related code actually do? - perl

#usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Index::Writer;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::HitCollector;
use Plucene::Search::IndexSearcher;
use Plucene::QueryParser;
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
Try as I may, I don't understand what this code does. I understand the familiar Perl syntax and what's going on on that end...but what is a Lucene Document, Index::Writer - etc.? Most importantly, when I run this code I expect something to be generated...yet I see nothing.
I know what an Analyzer is...thanks to this doc linked to in CPAN: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2. But I am just not getting why I run this code and it doesn't seem to DO anything...

Lucene is a search engine designed to search huge amounts of text very fast.
My perl is not strong, but from what I understand from Lucene objects:
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
This part creates a new document object and adds two text fields to it, content and author, in preparation to add it to an lucene index file as searchable data.
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
This part creates the index files and adds the previously created document do that index. At this point, you should have a "my_index" folder with several index files in it, in your application directory, with docs's data in it as searchable text.
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
This part attempts to search the index file created above for the same document data you just used to create the index file. Presumably, you'll have your search results in #docs at this point, which you might want to display to user (tho it is not, in this sample).
This seems to be a "hello world" application for Lucene usage in perl. In real-life applications, I dont see a scenario where you would create the index file and then search it from same piece of code.

Where did you get this code from? It is a copy of the code in the Synopsis at the start of the Plucene POD documentation.
I guess it was an attempt by someone to begin learning about Plucene. The code in a module's synopsis isn't necessarily meant to achieve something useful on its own.
As the documentation you refer to says, Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run.
Where did you get the idea that you should run the code you show?

Related

In Perl, can I dynamically add methods to only one object of a package?

I'm working with WWW::Mechanize to automate web-based back office clicking I need to do to get my test e-commerce orders into the state I need them to be to test changes I have made to a particular part of a long, multi-part workflow. To process a lot of orders in a batch, I need to click the Home link often. To make that shorter, I hacked a method into WWW::Mechanize at run time like this (based on an example in Mastering Perl by brian d foy):
{ # Shortcut to go back to the home page by calling $mech->go_home
# I know I'll get a warning and do not want it!
no warnings 'once';
my $homeLink = $mech->find_link( text => 'Home' )->url_abs();
$homeLink =~ s/system=0/system=1/;
*WWW::Mechanize::go_home = sub {
my ($self) = #_;
return $self->get($homeLink);
};
}
This works great, and does not hurt anyone because the script I'm using it in is only used by me and is not part of the larger system.
But now I wonder if it is possible to actually only tell one $mech object that is has this method, while another WWW::Mechanize object that might be created later (to, say, do some cross-referencing without mixing up the other one that has an active session to my back office) cannot use that method.
I'm not sure if that is possible at all, since, if I understand the way objects work in Perl, the -> operator tells it to look for the subroutine go_home inside the package WWW::Mechanize and pass the $mech as the first argument to it. Please correct me if this understanding is wrong.
I've experimented by adding a sort of hard-coded check that only lets the original $mech object use the function.
my $onlyThisMechMayAccessThisMethod = "$mech";
my $homeLink = $mech->find_link( text => 'Home' )->url_abs();
$homeLink =~ s/system=0/system=1/;
*WWW::Mechanize::go_home = sub {
my ($self) = #_;
return undef unless $self eq $onlyThisMechMayAccessThisMethod;
return $self->get($homeLink);
};
Since "$mech" contains the address of where the data is stored (e.g. WWW::Mechanize=HASH(0x2fa25e8)), another object will look differently when stringified this way.
I am not convinced however that this is the way to go. So my question is: Is there a better way to only let one object of the WWW::Mechanize class have this method? I'm also glad about other suggestions regarding this code.
This is just
$mech->follow_link(text => 'Home')
and I don't think it's special enough to warrant a method of its own, or to need restricting to an exclusive club of objects.
It's also worth noting that there is no need to mess with typeglobs to declare a subroutine in a different package. You just have to write, for example
sub WWW::Mechanize::go_home {
my ($self) = #_;
return $self->get($homeLink);
};
But the general solution is to subclass WWW::Mechanize and declare as members only those objects you want to have the new method.
File MyMechanize.pm
package MyMechanize;
use strict;
use warnings;
use parent 'WWW::Mechanize';
sub go_home {
my $self = shift;
my $homeLink = $self->find_link(text => 'Home')->url_abs;
$homeLink =~ s/system=0/system=1/;
return $self->get($homeLink);
}
1;
File test.pl
use strict;
use warnings;
use MyMechanize;
my $mech = MyMechanize->new;
$mech->get('http://mydomain.com/path/to/site/page.html')
$mech->go_home;

Why is it a bad idea to write configuration data in code?

Real-life case (from caff) to exemplify the short question subject:
$CONFIG{'owner'} = q{Peter Palfrader};
$CONFIG{'email'} = q{peter#palfrader.org};
$CONFIG{'keyid'} = [ qw{DE7AAF6E94C09C7F 62AF4031C82E0039} ];
$CONFIG{'keyserver'} = 'wwwkeys.de.pgp.net';
$CONFIG{'mailer-send'} = [ 'testfile' ];
Then in the code: eval `cat $config`, access %CONFIG
Provide answers that lay out the general problems, not only specific to the example.
There are many reasons to avoid configuration in code, and I go through some of them in the configuration chapter in Mastering Perl.
No configuration change should carry the risk of breaking the program. It certainly shouldn't carry the risk of breaking the compilation stage.
People shouldn't have to edit the source to get a different configuration.
People should be able to share the same application without using a common group of settings, instead re-installing the application just to change the configuration.
People should be allowed to create several different configurations and run them in batches without having to edit the source.
You should be able to test your application under different settings without changing the code.
People shouldn't have to learn how to program to be able to use your tool.
You should only loosely tie your configuration data structures to the source of the information to make later architectural changes easier.
You really want an interface instead of direct access at the application level.
I sum this up in my Mastering Perl class by telling people that the first rule of programming is to create a situation where you do less work and people leave you alone. When you put configuration in code, you spend more time dealing with installation issues and responding to breakages. Unless you like that sort of thing, give people a way to change the settings without causing you more work.
$CONFIG{'unhappy_employee'} = `rm -rf /`
One major issue with this approach is that your config is not very portable. If a functionally identical tool were built in Java, loading configuration would have to be redone. If both the Perl and the Java variation used a simple key=value layout such as:
owner = "Peter Palfrader"
email = "peter#peter#palfrader.org"
...
they could share the config.
Also, calling eval on the config file seems to open this system up to attack. What could a malicious person add to this config file if they wanted to wreak some havoc? Do you realize that ANY arbitrary code in your config file will be executed?
Another issue is that it's highly counter-intuitive (at least to me). I would expect a config file to be read by some config loader, not executed as a runnable piece of code. This isn't so serious but could confuse new developers who aren't used to it.
Finally, while it's highly unlikely that the implementation of constructs like p{...} will ever change, if they did change, this might fail to continue to function.
It's a bad idea to put configuration data in compiled code, because it can't be easily changed by the user. For scripts, just make sure it's separated entirely from the rest and document it nicely.
A reason I'm surprised no one mentioned yet is testing. When config is in the code you have to write crazy, contorted tests to be able to test safely. You can end up writing tests that duplicate the code they test which makes the tests nearly useless; mostly just testing themselves, likely to drift, and difficult to maintain.
Hand in hand with testing is deployment which was mentioned. When something is easy to test, it is going to be easy (well, easier) to deploy.
The main issue here is reusability in an environment where multiple languages are possible. If your config file is in language A, then you want to share this configuration with language B, you will have to do some rewriting.
This is even more complicated if you have more complex configurations (example the apache config files) and are trying to figure out how to handle potential differences in data structures. If you use something like JSON, YAML, etc., parsers in the language will be aware of how to map things with regards to the data structures of the language.
The one major drawback of not having them in a language, is that you lose the potential of utilizing setting config values to dynamic data.
I agree with Tim Anderson. Somebody here confuses configuration in code as configuration not being configurable. This is corrected for compiled code.
Both a perl or ruby file is read and interpreted, as is a yml file or xml file with configuration data. I choose yml because it is easier on the eye than in code, as grouping by test environment, development, staging and production, which in code would involve more .. code.
As a side note, XML contradicts the "easy on the eye" completely. I find it interesting that XML config is extensively used with compiled languages.
Reason 1. Aesthetics. While no one gets harmed by bad smell, people tend to put effort into getting rid of it.
Reason 2. Operational cost. For a team of 5 this is probably ok, but once you have developer/sysadmin separation, you must hire sysadmins who understand Perl (which is $$$), or give developers access to production system (big $$$).
And to make matters worse you won't have time (also $$$) to introduce a configuration engine when you suddenly need it.
My main problem with configuration in many small scripts I write, is that they often contain login data (username and password or auth-token) to a service I use. Then later, when the scripts gets bigger, I start versioning it and want to upload it on github.
So before every commit I need to replace my configuration with some dummy values.
$CONFIG{'user'} = 'username';
$CONFIG{'password'} = '123456';
Also you have to be careful, that those values did not eventually slip into your commit history at some point. This can get very annoying. When you went through this one or two times, you will never again try to put configuration into code.
Excuse the long code listing. Below is a handy Conf.pm module that I have used in many systems which allows you to specify different variables for different production, staging and dev environments. Then I build my programs to either accept the environment parameters on the command line, or I store this file outside of the source control tree so that never gets over written.
The AUTOLOAD provides automatic methods for variable retrieval.
# Instructions:
# use Conf;
# my $c = Conf->new("production");
# print $c->root_dir;
# print $c->log_dir;
package Conf;
use strict;
our $AUTOLOAD;
my $default_environment = "production";
my #valid_environments = qw(
development
production
);
#######################################################################################
# You might need to change this.
sub set_vars {
my ($self) = #_;
$self->{"access_token"} = 'asdafsifhefh';
if ( $self->env eq "development" ) {
$self->{"root_dir"} = "/Users/patrickcollins/Documents/workspace/SysG_perl";
$self->{"server_base"} = "http://localhost:3000";
}
elsif ($self->env eq "production" ) {
$self->{"root_dir"} = "/mnt/SysG-production/current/lib";
$self->{"server_base"} = "http://api.SysG.com";
$self->{"log_dir"} = "/mnt/SysG-production/current/log"
} else {
die "No environment defined\n";
}
#######################################################################################
# You shouldn't need to configure this.
# More dirs. Move these into the dev/prod sections if they're different per env.
my $r = $self->{'root_dir'};
my $b = $self->{'server_base'};
$self->{"working_dir"} ||= "$r/working";
$self->{"bin_dir"} ||= "$r/bin";
$self->{"log_dir"} ||= "$r/log";
# Other URLs. Move these into the dev/prod sections if they're different per env.
$self->{"new_contract_url"} = "$b/SysG-training-center/v1/contract/new";
$self->{"new_documents_url"} = "$b/SysG-training-center/v1/documents/new";
}
#######################################################################################
# Code, don't change below here.
sub new {
my ($class,$env) = #_;
my $self = {};
bless ($self,$class);
if ($env) {
$self->env($env);
} else {
$self->env($default_environment);
}
$self->set_vars;
return $self;
}
sub AUTOLOAD {
my ($self,$val) = #_;
my $type = ref ($self) || die "$self is not an object";
my $field = $AUTOLOAD;
$field =~ s/.*://;
#print "field: $field\n";
unless (exists $self->{$field} || $field =~ /DESTROY/ )
{
die "ERROR: {$field} does not exist in object/class $type\n";
}
$self->{$field} = $val if ($val);
return $self->{$field};
}
sub env {
my ($self,$in) = #_;
if ($in) {
die ("Invalid environment $in") unless (grep($in,#valid_environments));
$self->{"_env"} = $in;
}
return $self->{"_env"};
}
1;

What's the best method to generate Multi-Page PDFs with Perl and PDF::API2?

I have been using PDF::API2 module to program a PDF. I work at a warehousing company and we are trying switch from text packing slips to PDF packing slips. Packing Slips have a list of items needed on a single order. It works great but I have run into a problem. Currently my program generates a single page PDF and it was all working fine. But now I realize that the PDF will need to be multiple pages if there are more than 30 items in an order. I was trying to think of an easy(ish) way to do that, but couldn’t come up with one. The only thing I could think of involves creating another page and having logic that redefines the coordinates of the line items if there are multiple pages. So I was trying to see if there was a different method or something I was missing that could help but I wasn’t really finding anything on CPAN.
Basically, i need to create a single page PDF unless there are > 30 items. Then it will need to be multiple.
I hope that made sense and any help at all would be greatly appreciated as I am relatively new to programming.
Since you already have the code working for one-page PDFs, changing it to work for multi-page PDFs shouldn't be too hard.
Try something like this:
use PDF::API2;
sub create_packing_list_pdf {
my #items = #_;
my $pdf = PDF::API2->new();
my $page = _add_pdf_page($pdf);
my $max_items_per_page = 30;
my $item_pos = 0;
while (my $item = shift(#items)) {
$item_pos++;
# Create a new page, if needed
if ($item_pos > $max_items_per_page) {
$page = _add_pdf_page($pdf);
$item_pos = 1;
}
# Add the item at the appropriate height for that position
# (you'll need to declare $base_height and $line_height)
my $y = $base_height - ($item_pos - 1) * $line_height;
# Your code to display the line here, using $y as needed
# to get the right coordinates
}
return $pdf;
}
sub _add_pdf_page {
my $pdf = shift();
my $page = $pdf->page();
# Your code to display the page template here.
#
# Note: You can use a different template for additional pages by
# looking at e.g. $pdf->pages(), which returns the page count.
#
# If you need to include a "Page 1 of 2", you can pass the total
# number of pages in as an argument:
# int(scalar #items / $max_items_per_page) + 1
return $page;
}
The main thing is to split up the page template from the line items so you can easily start a new page without having to duplicate code.
PDF::API2 is low-level. It doesn't have most of what you would consider necessary for a document, things like margins, blocks, and paragraphs. Because of this, I afraid you're going to have to do things the hard way. You may want to look at PDF::API2::Simple. It might meet your criteria and it's simple to use.
I use PDF::FromHTML for some similar work. Seems to be a reasonable choice, I guess I am not too big on positioning by hand.
The simplest method is to use PDF-API2-Simple
my #content;
my $pdf = PDF::API2::Simple->new(file => "$name");
$pdf->add_font('Courier');
$pdf->add_page();
foreach $line (#content)
{
$pdf->text($line, autoflow => 'on');
}
$pdf->save();

How can I write a Perl script to automatically take screenshots?

I want a platform independent utility to take screenshots (not just within the browser).
The utility would be able to take screenshots after fixed intervals of time and be easily configurable by the user in terms of
time between successive shots,
the format the shots are stored,
till when (time, event) should the script run, etc
Since I need platform independence, I think Perl is a good choice.
a. Before I start out, I want to know whether a similar thing already exists, so I can start from there?
Searching CPAN gives me these two relevant results :
Imager-Screenshot-0.009
Imager-Search-1.00
From those pages, the first one looks easier.
b. Which one of these Perl modules should I use?
Taking a look at the sources of both, Imager::Search isn't much more than a wrapper to Imager::Screenshot.
Here's the constructor:
sub new {
my $class = shift;
my #params = ();
#params = #{shift()} if _ARRAY0($_[0]);
my $image = Imager::Screenshot::screenshot( #params );
unless ( _INSTANCE($image, 'Imager') ) {
Carp::croak('Failed to capture screenshot');
}
# Hand off to the parent class
return $class->SUPER::new( image => $image, #_ );
}
Given that Imager::Search does not really extend Imager::Screenshot much more, I'd say you're looking at two modules that are essentially the same.

Is there a tool for extracting all variable, module, and function names from a Perl module file?

My apologies if this is a duplicate; I may not know the proper terms to search for.
I am tasked with analyzing a Perl module file (.pm) that is a fragment of a larger application. Is there a tool, app, or script that will simply go through the code and pull out all the variable names, module names, and function calls? Even better would be something that would identify whether it was declared within this file or is something external.
Does such a tool exist? I only get the one file, so this isn't something I can execute -- just some basic static analysis I guess.
Check out the new, but well recommended Class::Sniff.
From the docs:
use Class::Sniff;
my $sniff = Class::Sniff->new({class => 'Some::class'});
my $num_methods = $sniff->methods;
my $num_classes = $sniff->classes;
my #methods = $sniff->methods;
my #classes = $sniff->classes;
{
my $graph = $sniff->graph; # Graph::Easy
my $graphviz = $graph->as_graphviz();
open my $DOT, '|dot -Tpng -o graph.png' or die("Cannot open pipe to dot: $!");
print $DOT $graphviz;
}
print $sniff->to_string;
my #unreachable = $sniff->unreachable;
foreach my $method (#unreachable) {
print "$method\n";
}
This will get you most of the way there. Some variables, depending on scope, may not be available.
If I understand correctly, you are looking for a tool to go through Perl source code. I am going to suggest PPI.
Here is an example cobbled up from the docs:
#!/usr/bin/perl
use strict;
use warnings;
use PPI::Document;
use HTML::Template;
my $Module = PPI::Document->new( $INC{'HTML/Template.pm'} );
my $sub_nodes = $Module->find(
sub { $_[1]->isa('PPI::Statement::Sub') and $_[1]->name }
);
my #sub_names = map { $_->name } #$sub_nodes;
use Data::Dumper;
print Dumper \#sub_names;
Note that, this will output:
...
'new',
'new',
'new',
'output',
'new',
'new',
'new',
'new',
'new',
...
because multiple classes are defined in HTML/Template.pm. Clearly, a less naive approach would work with the PDOM tree in a hierarchical way.
Another CPAN tools available is Class::Inspector
use Class::Inspector;
# Is a class installed and/or loaded
Class::Inspector->installed( 'Foo::Class' );
Class::Inspector->loaded( 'Foo::Class' );
# Filename related information
Class::Inspector->filename( 'Foo::Class' );
Class::Inspector->resolved_filename( 'Foo::Class' );
# Get subroutine related information
Class::Inspector->functions( 'Foo::Class' );
Class::Inspector->function_refs( 'Foo::Class' );
Class::Inspector->function_exists( 'Foo::Class', 'bar' );
Class::Inspector->methods( 'Foo::Class', 'full', 'public' );
# Find all loaded subclasses or something
Class::Inspector->subclasses( 'Foo::Class' );
This will give you similar results to Class::Sniff; you may still have to do some processing on your own.
There are better answers to this question, but they aren't getting posted, so I'll claim the fastest gun in the West and go ahead and post a 'quick-fix'.
Such a tool exists, in fact, and is built into Perl. You can access the symbol table for any namespace by using a special hash variable. To access the main namespace (the default one):
for(keys %main::) { # alternatively %::
print "$_\n";
}
If your package is named My/Package.pm, and is thus in the namespace My::Package, you would change %main:: to %My::Package:: to achieve the same effect. See the perldoc perlmod entry on symbol tables - they explain it, and they list a few alternatives that may be better, or at least get you started on finding the right module for the job (that's the Perl motto - There's More Than One Module To Do It).
If you want to do it without executing any code that you are analyzing, it's fairly easy to do this with PPI. Check out my Module::Use::Extract; it's a short bit of code shows you how to extract any sort of element you want from PPI's PerlDOM.
If you want to do it with code that you have already compiled, the other suggestions in the answers are better.
I found a pretty good answer to what I was looking for in this column by Randal Schwartz. He demonstrated using the B::Xref module to extract exactly the information I was looking for. Just replacing the evaluated one-liner he used with the module's filename worked like a champ, and apparently B::Xref comes with ActiveState Perl, so I didn't need any additional modules.
perl -MO=Xref module.pm